Discourse Parsing: A Decision Tree Approach
نویسندگان
چکیده
The paper presents a new statistical method, for parsing discourse. A parse of discourse is defined as a set of semantic dependencies among sentences tha t make up the discourse. A collection of news articles from a Japanese economics daily are manually marked for dependency and used as a training/ test ing corpus. We use a C4.5 decision tree method to develop a model of sentential dependencies. However, rather than to use class decisions made by C4.5, we exploit information on class distr ibutions to rank possible dependencies among sentences according to their probabilistic s t rength and take a parse to be a set of highest ranking dependencies. We also study effects of features such as clue words, distance and similarity on the performance of the discourse parser. Experiments have found tha t the method performs reasonably well on diverse text types, scoring an accuracy rate of over 60%. 1 I n t r o d u c t i o n At tempts to the automatic identification of a structure in discourse have so far met with a limited success in the computat ional linguistics li terature. Par t of the reason is that, compared to sizable da ta resources available to parsing research such as the Penn Treebank (Marcus et al., 1993), large corpora annotated for discourse information are hard to come by. Researchers in discourse usually work with a corpus of a few hundred sentences (Kurohashi and Nagao, 1994; Li tman and Passonneau, 1995; Hearst, 1994). The lack of a large-scale corpus has made it impossible to talk about results of discourse studies with the sufficient degree of reliability. In the work described here, we created a corpus with discourse information, containing 645 articles from a Japanese economic paper, an order of magnitude larger than any previous work on discourse processing. It had a total of 12.770 sentences and 5,352 paragraphs. Each article in the corpus was manually annotated for a discourse dependency" relation. We then built a statistical discourse parser based on the C4.5 decision tree method (Quinlan, 1993), which was trained and tested on the corpus we have creFigure 1: A discourse tree. 'S' denotes a sentence. ated. The design of a parser was inspired by Haruno (1997)'s work on statistical sentence parsing. The paper is organized as follows. Section 2 presents general ideas about statistical parsing as applied to the discourse, After a brief introduction to some of the points of a decision tree model, we discuss incorporating a decision tree within a statistical parsing model. In Section 3, we explain how we have built an annotated corpus. There we also describe a procedure of experiments we have conducted, and conclude the section with their results. 2 S t a t i s t i c a l D i s c o u r s e P a r s i n g First, let us make ourselves clear about what we mean by parsing a discourse. The job of parsing is to find whatever dependencies there are among elements that make up a part icular linguistic unit. In discourse parsing, elements one is interested in finding dependencies among correspond to sentences, and a level of unit under investigation is a discourse. We take a naive approach to the notion of a dependency here. We think of it as a re!ationship between a pair of sentences such that the interpretat ion of one sentence in some way depends on that of the other. Thus a dependency relationship is not a s t ructural one, but rather a semantic or rhetorical one. The job of a discourse parser is to take as input
منابع مشابه
An effective Discourse Parser that uses Rich Linguistic Information
This paper presents a first-order logic learning approach to determine rhetorical relations between discourse segments. Beyond linguistic cues and lexical information, our approach exploits compositional semantics and segment discourse structure data. We report a statistically significant improvement in classifying relations over attribute-value learning paradigms such as Decision Trees, RIPPER...
متن کاملHierarchical Discourse Parsing Based on Similarity Metrics
Attentional State Theory and Rhetorical Structure Theory are two predominant theories of discourse parsing. Combining these two approaches, in this paper, we describe a novel approach for discourse parsing. The resulting discourse tree structure retains following properties: structure of purpose from Attentional State Theory and relations between sentences from Rhetorical Structure Theory. We d...
متن کاملText-level Discourse Dependency Parsing
Previous researches on Text-level discourse parsing mainly made use of constituency structure to parse the whole document into one discourse tree. In this paper, we present the limitations of constituency based discourse parsing and first propose to use dependency structure to directly represent the relations between elementary discourse units (EDUs). The state-of-the-art dependency parsing tec...
متن کاملDesigning a Discourse Parser for the Evaluative Text Genre
We propose designing a discourse parser specifically for the evaluative text genre. We aim to see whether focusing on a certain genre and relations specific to that genre offers performance gain beyond more generic discourse parsers. In this extended abstract we describe the approach we intend to take, and how this differs from what has been done previously. The problem of discourse parsing It ...
متن کاملDiscriminative Reranking of Discourse Parses Using Tree Kernels
In this paper, we present a discriminative approach for reranking discourse trees generated by an existing probabilistic discourse parser. The reranker relies on tree kernels (TKs) to capture the global dependencies between discourse units in a tree. In particular, we design new computational structures of discourse trees, which combined with standard TKs, originate novel discourse TKs. The emp...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1998